OpenCL (short for Open Computing Language) is an open standard for running compute workloads on many different kinds of compute hardware (e.g CPUs, GPU's, and FPGA's). The OpenCL trademark is held by Apple, and the standard is developed and released by the Khronos group, a non-for-profit organisation that provides a focal point for the development of royalty-free standards such as OpenGL and Vulkan. The OpenCL specification itself is just a document, and can be downloaded from the Khronos website here. It is then the task of compute hardware vendors to produce software implementations of OpenCL that best make use of their compute devices.
Graphics processing units (GPU's) were originally designed to perform the complex calculations for computing the values for pixels in graphical applications such as 3D rendering. As this process is readily parallelisable, the rendering calculations were offloaded to specialised hardware pipelines to be performed in parallel. Eventually this specialised hardware became generalised and programmable, and GPU's became capable of other tasks like performing scientific calculations. Commercial pressure to achieve the best frame rates in games led to GPU designs that incorporate high bandwidth memory and the ability to parallelise calculations over thousands of discrete processing elements. These days GPU's have floating point performance and memory bandwidth that exceeds CPU's by as much as an order of magnitude. Below is a table of the estimated capacity of the compute devices on Setonix.
| Compute device | Peak FP32 processing power (TFLOP/s) |
|---|---|
| AMD EPYC 7763 | 1.8 |
| AMD Radeon Instinct MI250X | 2 x 23.95 |
As with CUDA and HIP, OpenCL is a software framework that provides a way to harness the compute power of modern GPU's. However, it also has the ability to use CPU's and FPGA's as compute devices.
OpenCL is a mature specification that is supported and developed by the Khronos Group which is a member-led consortium of more than 150 companies. Every major computing vendor, such as AMD, NVIDIA, Intel, Xilinx and ARM has an implementation of OpenCL that follows the specification. This means that a code that is developed for one compute device has the potential to run on another compute device, even if it is from another vendor, without significant modification. In supercomputing there is a welcome change in diversity among available hardware options. OpenCL enables users to navigate this diversity without incurring significant costs when moving platforms. For some compute devices OpenCL is the only computing framework available.
While the ability to switch hardware vendors is of significant value to end users, it can be of lesser importance to some hardware vendors, who by nature prefer to devote resources to ensuring the success of their proprietary frameworks. Thus progress on OpenCL implementations tend to lag behind, both in performance and capability, when compared to frameworks like CUDA and HIP. Therefore with OpenCL on some platforms, one can expect to encounter poor to non-existent support from debugging and performance monitoring tools, as well as performance that approaches but seldom exceeds what can be achieved with proprietary frameworks.
OpenCL is a mature standard, and it has first-rate support for consistent math across compute devices, as well as support for large vector types up to 16 elements long. Shared memory support in OpenCL is first rate, and is well placed to make use of System-On-Chip architectures such as the AMD Instinct MI300.
This is sometimes a difficult question to answer. Researchers often have diverse computing needs, in such cases OpenCL is a good fit as it will provide a solid and flexible foundation for your research tools. If you are looking for the best possible performance and vendor lock-in will not be an issue, then it is probably better to look at using vendor-specific tools and frameworks. However if you prefer open standards, and you need consistent results across devices, stability, support for large vector types, and great documentation that comes with a mature implementation, then OpenCL can be a solid foundation for your project. Here are some challenges and benefits in using OpenCL:
Challenges with using OpenCL
Benefits from using OpenCL
An OpenCL implementation is a framework to support running lightweight pieces of code, called kernels, in parallel over the available cores of a compute device. Below is an example kernel to compute the floating point absolute value of a single element in an array of floating point numbers.
__kernel void vec_fabs(
// Memory allocations that are on the compute device
__global float *src,
__global float *dst,
// Number of elements in the memory allocations
int length) {
// Get our position in the array
size_t gid0 = get_global_id(0);
// Get the absolute value of
if (gid0 < length) {
dst[gid0] = fabs(src[gid0]);
}
}
In order to take the absolute value of every element we need to run this kernel at every point in the array. A software thread can be thought of as the execution of a sequence of compute instructions independently from other threads. In that sense a kernel is run in a software thread.
A hardware thread is a pipeline of physical machinery that executes the instructions in a software thread. Compute devices have a number of cores to manage memory and execute software threads. In AMD and OpenCL terminology these cores are called Compute Units. Every compute unit makes available to the OS a number of hardware threads for running kernels. In OpenCL terminology a processing element is synonymous with a hardware thread.
GPU's use a SIMT (Single Instruction Multiple Threads) processing model, where instructions are executed by the Compute Unit over teams of hardware threads that operate in lock-step with each other and in parallel. For AMD GPUs the team is 32-64 threads wide and is called a wavefront. For NVIDIA GPUs a team is usually 32 threads wide and is called a warp. Each hardware thread in a GPU has access to specialised floating point and integer units for perfoming math operations. These specialised units are known as shader cores (AMD) or CUDA cores (NVIDIA). There are many thousands of these units in a GPU and this feature, along with greater memory bandwidth is responsible for the significant performance advantage that GPU's have over CPU's.
The example below a graphical layout of an AMD MI250X GPU processor. Each processor contains two GPU dies; each die contains 8 shader engines; and each shader engine contains ~14 compute units, for a total of 110 Compute Units per die. Every compute unit commands a wavefront of 64 hardware threads, therefore on this processor there are two unique compute devices, each with $110\times64 = 7040$ available hardware threads for use in compute applications.
In CPUs each compute unit also makes available to the OS a number of hardware threads - usually 2-4. These threads are more "independent" than their GPU equivalents and are not constrained to operate in lock-step with each other. CPU hardware threads have access to SIMD vector units to perform vector math operations, however this hardware is only accessed through special vector instructions that the compiler conservatively generates if it deems it is safe to do so. OpenCL has vector data types whose math operations map easily to SIMD vector instructions, thus making it easier to build software that uses SIMD vector units in CPUs.
An OpenCL implementation is a way to run kernel instances on processing elements (hardware threads) as they become available. The implementation also provides the means to upload and download memory to and from compute devices. We specify how many kernel instances we want at runtime by defining a 3D execution space called a Grid and setting its size at kernel launch. Every point in the Grid is called a work-item, and represents a unique invocation of the kernel. A work-item is equivalent to a single kernel invocation. This is much like defining an execution space using nested loops, however with OpenCL there are no guarantees on the order in which work items are completed.
Work-items are executed in teams called work-groups. In the example above, the grid is of global size (10, 8, 2) and each work-group is of size (5,4,1). The the number of work-groups in each dimension is then (2,2,2). Every work-item has access to device memory that it can use exclusively (private memory), access to memory the team can use (local memory), and access to memory that other teams use (global and constant memory). Every kernel invocation or work-item can query its location within the Grid and use that position as a reference to access allocated memory on the compute device at an appropriately calculated offset.
The above concepts form the core ideas surrounding OpenCL. Everything that follows in this course is supporting information on how to prepare compute devices, manage memory, invoke kernels, and how best to use these concepts together to get the best performance out of your compute devices.
In every accelerated application there is the concept of a host computer on which there are one or more compute devices. The host usually has the largest memory space available and the compute device usually has the most compute power and memory bandwidth. This is why we say the application is "accelerated" by the compute device.
At runtime, the host executes the application. During execution kernels are compiled for available compute devices. The host program manages memory allocations on the compute device and executes compiled kernels on the compute device. For instances where the compute device is a CPU the host CPU and the compute device are the same thing.
Accelerated applications follow the same logical progression of steps:
We now discuss the OpenCL components that make these steps possible.
Below is a representation of the core software components that are available to an OpenCL application.
The first is the Platform. This is a software representation of a vendors implementation. A platform provides access to all devices that the platform supports. During device discovery, available platforms are queried first. A platform provides access to one or more compute devices and possibly even a mixture of accelerator devices from the same vendor.
A Device provides a way to query the capabilites of the compute device and provides a foundation to build a Context.
Surrounding the devices is a Context. A Context is a resource manager that keeps track of everything (i.e kernel executions and memory allocations) that are happening on the compute device/s. A Context is constructed on using both a platform and one or more devices on the platform. There are some benefits (such as memory copies) that could be obtained by encapsulating one or more devices under the same context, however this assumes that devices must belong to the same platform - an assumption which may not be true. A simpler and more general design is to create a unique context for every compute device.
Within the control of the Context are Buffers. Buffers are memory allocations managed under the context, and may exist on either the host or the compute device. At runtime memory is migrated to where is needed, but you can have some control over where a buffer "lives".
At runtime, source code for the kernels are collated into Programs. This is repeated for every utilised context. In a subsequent step programs are built for every utilised compute device in a context
Once a context has been created and devices are known, then one can create one or more Command queue/s for each device. A command queue is a place to submit work, such as kernel invocations and memory copies. When multiple command queues are used then it can leverage the asynchronous capabilities of the compute device, such as being able to copy during compute.
A Kernel is a function that is part of a compiled Program. At runtime we set the arguments of a compile kernel and then submit the kernel to a command queue for execution. We can keep track of the status of a command submitted to the command queue using an Event.
In summary we have the following components:
From Wikipedia OpenCL was originally designed by Apple, who developed a proposal to submit to the Khronos group and holds the trademarks to OpenCL. The first specification, OpenCL 1.0, was ratified on November 18, 2008 and the first public release of the standard was on December 2008. Since then a number of different versions of the standard have been released.
Version 1.1 introduced thread safety so that calls to most OpenCL functions from different threads didn't introduce race conditions. If memory allocations in buffers are used to represent 2D and 3D arrays, then Version 1.1 introduced routines to copy rectangular regions of those buffers to and from the host.
Version 1.2 is probably the most significant release of OpenCL. It remained the defacto OpenCL standard for at least 10 years. Abilities such as being able to divide the processing elements of a compute device into sub-devices that share a common cache and offline compilation of kernels were useful. The introduction of math operations that conform to the IEEE754 precision standard meant consistent results across heterogeneous compute architectures.
Version 2.0 introduced support for Shared Virtual Memory (SVM). Implementation of SVM meant we no longer needed to qualify which space (i.e global, local..) a memory allocation belonged to, and memory could be transferred to and from devices transparently to the user. This was too much for some vendors to implement however, and a few vendor implementations remained at 1.2 for a number of years.
Version 2.1 brought the SPIR-V (Standard Portable Intermediate Representation) language to OpenCL. During compilation a open-source compiler can take C or C++ kernel code and emit a compiled program as SPIR-V intermediate code. At runtime this program is loaded by the application and passed to the vendor driver for further compilation to binary code that can run on the compute device. This is a significant advancement, because if a vendor can implement support for SPIR-V then it dramatically reduces the number of intermediate representations the vendor compiler must support. It also offloads support for kernel language advancements to the open source compiler and provides a measure of security against intellectual property theft.
Version 2.2 allowed kernels to be produced using a subset of the C++14 standard. It also updated support for SPIR-V to version 1.2. The combination of shared virtual memory, C++ kernels, and SPIR-V support meant that very few vendors actually succeeded in producing viable implementations of OpenCL 2.2, and OpenCL stagnated for a period of 5 years.
Version 3.0 addressed the issue of stagnation by making Version 1.2 standard and all the other improvements in Version 2.x optional. This gave vendors freedom to implement what they wanted for customers and gave the standard some breathing room. Version 3.0 also introduced a new C++ language for kernels (called C++ for OpenCL) that uses a subset of the C++17 standard. The Clang compiler supports compilation of C++ for OpenCL kernels into SPIR-V format.
Below is a summary of major features implemented with each release:
| Specification | Release year | Specifics |
|---|---|---|
| 1.0 | 2008 | Initial implementation |
| 1.1 | 2010 |
|
| 1.2 | 2011 |
|
| 2.0 | 2013 |
|
| 2.1 | 2015 |
|
| 2.2 | 2015 |
|
| 3.0 | 2020 |
|
All of the major vendors have OpenCL implementations at varying levels of support for the OpenCL specification. The table below shows the latest known level of support for each version of the specification, along with links to the vendor's OpenCL developer page.
| Vendor | 1.2 | 2.0 | 2.1 | 2.2 | 3.0 |
|---|---|---|---|---|---|
| AMD | Y | Y | Y | Some | N |
| Apple | Y | N | N | N | N |
| ARM | Y | Y | Y | N | Y |
| Intel | Y | Y | Y | Some | Y |
| NVIDIA | Y | N | N | N | Y |
| Portable OpenCL | Y | Some | N | N | Y |
Apple was the original vendor for OpenCL and it comes baked into the MacOS operating system. However the company has since moved on to their proprietary framework Metal and they haven't invested in OpenCL beyond specification 1.2. Support for OpenCL is built in to NVIDIA's CUDA toolkit, though after an initial flurry of development activity up to version 1.2, development stalled until version 3.0. Support for OpenCL with AMD is part of the ROCM suite. Intel strongly supports OpenCL development for CPU's and GPU's with its oneAPI toolkit. The CPU implementation in oneAPI also works for AMD CPU's, which is really good! ARM has solid support for OpenCL on its Mali GPU's. The open source POCL (Portable OpenCL) implementation has a CPU implementation as well as support for OpenCL on CUDA and OpenCL on MacOS.
A conformant OpenCL implementation is an implementation of OpenCL that has passed Khronos' test suite. The number of vendors with conformant implementations is an evolving list, click here to see the latest conformant implementations.
The best source of help for OpenCL is Khronos OpenCL registry. There you can find excellent documentation on the latest specification that your vendor supports. Let's download the latest API specification in PDF format and have it ready as reference material.
Download from the Khronos OpenCL registry the latest OpenCL API and C language specifications to your computer.
Just to avoid confusion, there are two compilation steps for OpenCL applications:
During program execution, kernels are combined into programs and the programs are compiled for each compute device using the vendor's kernel compiler. Thankfully, when compiling an OpenCL application prior to execution (Step 1), we don't need to link against every available implementation. We just need to link against a single library file called the Installable Client Driver (ICD) Loader that may be provided by any vendor. The ICD loader has the name (opencl.dll) on Windows and (libOpenCL.so) on Linux. Accompanying the ICD loader are header files (opencl.h for C and opencl.hpp for C++) that live in a directory called CL. This directory must be in the include path of an application at compilation. It is good practice to source an ICD loader (i.e libOpenCL.so) and OpenCL headers directly from the Khronos Github site, because then you will always have the latest up-to-date OpenCL API. Alternatively, you can use a ICD loader provided by the vendor, but it may not be up with the latest OpenCL standard.
In the source program include the OpenCL headers. When using the C API you can just include the file opencl.h like this:
#include <CL/opencl.h>
Or, if you like to use the C++ API you can include the C++ headers, like this:
#include <CL/opencl.hpp>
We will be using the C API for this course, both to maintain compatibility with C and because the documentation for the C API is much better! Compilation of an OpenCL source file on Linux usually takes this form:
<compiler> -I<path to CL directory> -L<path to ICD loader> <source file> -o <destination program> -lOpenCL
When a program launches, the ICD loader looks up a directory to find OpenCL implementations. On Linux this directory is usually
/etc/OpenCL/vendors
However you can change the place to look for implementations by setting the environment variable
OCL_ICD_VENDORS to another directory. Inside the vendors directory are a number of
*.icd files, one for each vendor implementation of OpenCL. The contents of those files are just
the libary path for the vendor's OpenCL implementation. The ICD loader takes care of intercepting all OpenCL
library calls and routing them to that of the appropriate vendor implementation, depending on the compute
device being used. The routing process happens transparently to the user.
At the location hello_devices.cpp is a complete OpenCL application to obtain the size of on-device memory and the maximum Buffer size that is possible within that memory.
cd src/L1_Introduction
| File | Directory |
|---|---|
| ICD loader (libOpenCL.so) | /usr/lib/x86_64-linux-gnu |
| OpenCL C++ headers directory (CL) | /usr/include |
In the Terminal use ls to list the contents of these directories and locate the CL directory in which the OpenCL header files are located. On Linux you can add the location of the CL directory to your CPATH environment variable, and the location of libOpenCL.so to both your LIBRARY and LD_LIBRARY_PATH environment variables. Then you won't need to explicity tell the compiler where the OpenCL resources are. Compile the application source file hello_devices.cpp using the g++ compiler. The compilation command should look like this:
g++ -g -O2 hello_devices.cpp -o hello_devices.exe -lOpenCL
./hello_devices.exe
You should see at least one device printed with the name and memory sizes. Now that you know how to let the compiler know about OpenCL you can use the make command within that directory to compile the example.
!make clean; make; ./hello_devices.exe
rm -rf *.exe CC -g -fopenmp -O2 -I../include hello_devices.cpp -o hello_devices.exe -lOpenCL Platform 0 Device index: 0 name: gfx1035 global memory size: 536 MB Platform 1 Device index: 1 name: NVIDIA GeForce RTX 3060 Laptop GPU global memory size: 6226 MB Platform 2 Device index: 2 name: AMD Ryzen 7 6800H with Radeon Graphics global memory size: 32865 MB Platform 3 Device index: 3 name: Intel(R) FPGA Emulation Device global memory size: 32865 MB
This application is rather rudimentary, however there is a far more sophisticated OpenCL query application called clinfo. You can use it to query a great deal on information on the available devices. Here we use clinfo to query available platforms and devices.
!clinfo -l
Platform #0: AMD Accelerated Parallel Processing `-- Device #0: gfx1035 Platform #1: NVIDIA CUDA `-- Device #0: NVIDIA GeForce RTX 3060 Laptop GPU Platform #2: Intel(R) OpenCL `-- Device #0: AMD Ryzen 7 6800H with Radeon Graphics Platform #3: Intel(R) FPGA Emulation Platform for OpenCL(TM) `-- Device #0: Intel(R) FPGA Emulation Device